Tweets Language Identification using Feature Weighting
نویسندگان
چکیده
This paper describes the language identification method presented in Twitter Language Identification Workshop (TweetLID-2014). The proposed method represents tweets by weighted character-level trigrams. We employed three different weighting schemes used in Text Categorization to obtain a numerical value that represents the relation between trigrams and languages. For each language, we add up the importance of each trigram. Afterward, tweet language is determined by simple majority voting. Finally, we analyze the results.
منابع مشابه
Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion
This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify...
متن کاملA Model for Detecting of Persian Rumors based on the Analysis of Contextual Features in the Content of Social Networks
The rumor is a collective attempt to interpret a vague but attractive situation by using the power of words. Therefore, identifying the rumor language can be helpful in identifying it. The previous research has focused more on the contextual information to reply tweets and less on the content features of the original rumor to address the rumor detection problem. Most of the studies have been in...
متن کاملShort Text Classification Using Deep Representation: A Case Study of Spanish Tweets in Coset Shared Task
Topic identification as a specific case of text classification is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector result from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance...
متن کاملClassifier Stacking for Native Language Identification
This paper reports our contribution (team WLZ) to the NLI Shared Task 2017 (essay track). We first extract lexical and syntactic features from the essays, perform feature weighting and selection, and train linear support vector machine (SVM) classifiers each on an individual feature type. The output of base classifiers, as probabilities for each class, are then fed into a multilayer perceptron ...
متن کاملTime-Sensitive Weighting for Microblog Retrieval
We report our system and experiments for the realtime Adhoc task in the 2011 MicroBlog track. Our goal is to develop effective technique to retrieve relevant tweets that have been posted recently. In particular, we propose a time-sensitive term weighting strategy that can favor tweets in hot-discussed time and a document length related weighting method that can favor long tweets which are mor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014